Based on the American television game show Let’s Make a Deal and its host, named Monty Hall
The game involves three doors
behind one of which is a car (the prize)
behind the other two are goats (not the prize)
You pick a door — say, door 1
Now, the host, who knows what’s behind the doors, opens one of the other doors — say, door 2 — which reveals a goat
The host then asks you:
Do you want to stay with door 1, or
would you like to switch to door 3?
What should you do?
Use a computational method to simulate the Monty Hall game
Code
# Simulate the Monty Hall gamedoors <-c("Door 1", "Door 2", "Door 3")win_count <-0n<-10000# Number of simulationsfor (i in1:n) {# Randomly place the car behind one of the doors car_door <-sample(doors, 1, replace =TRUE)# Player randomly chooses a door player_choice <-sample(doors, 1, replace =TRUE)# Check if the player wins by staying with their choiceif (player_choice == car_door) { win_count <- win_count +1 }}print(paste("the Probability of Winning by NOT switching:", win_count/n))
[1] "the Probability of Winning by NOT switching: 0.3449"
the scenario of switching a door
Code
win_count_switch <-0for (i in1:n) {# Randomly place the car behind one of the doors car_door <-sample(doors, 1, replace =TRUE)# Player randomly chooses a door player_choice <-sample(doors, 1, replace =TRUE)# Host opens a door that is not the player's choice and does not have the car remaining_doors <-setdiff(doors, c(player_choice, car_door)) host_opens <-sample(remaining_doors, 1, replace =TRUE)# The player switches to the remaining closed door switch_choice <-setdiff(doors, c(player_choice, host_opens))# Check if the player wins by switchingif (switch_choice == car_door) { win_count_switch <- win_count_switch +1 }}print(paste("the Probability of Winning by switching:", win_count_switch/n))
[1] "the Probability of Winning by switching: 0.666"
library(maps)ggplot() +borders(database ="state") +geom_point(aes(x = long, y = lat, color = type, size = size),data = walmart, alpha =0.6) +coord_quickmap() +scale_size_identity() +theme_void(base_size =12) +# remove all extra formattinglabs(color ="Type") # change the label for the legend
Plotting over time
The data set contains the opening date of each store, so we can make the maps to show how Walmart expanded over time
useful to define a function that subsets the data given a specified date
Code
library(tidyverse)library(lubridate)walmart.map <-function(data, date, show_legend =TRUE) { temp <-filter(data, opendate <= date) %>%mutate(size =if_else(type =="Distribution \ncenter", 3, 1)) p<-ggplot() +borders(database ="state") +geom_point(data = temp,aes(x = long, y = lat, color = type, size = size),alpha =0.6) +coord_quickmap() +scale_size_identity() +theme_void(base_size =12) +labs(color ="Type") +ggtitle(year(date))if (show_legend) { p <- p +theme(legend.position ="bottom", legend.direction ="horizontal") +guides(size ="none") # Suppress size legend, keep only color legend } else { p <- p +theme(legend.position ="none") }return(p)}
create a map for every ten years
Code
library(grid)library(gridExtra)library(patchwork)p1<-walmart.map(walmart, as.Date("1974-12-31"), show_legend =FALSE)p2<-walmart.map(walmart, as.Date("1984-12-31"), show_legend =FALSE)p3<-walmart.map(walmart, as.Date("1994-12-31"), show_legend =FALSE)p4<-walmart.map(walmart, as.Date("2004-12-31"), show_legend =TRUE)# Combine plots in a 2x2 grid with a shared legend and main title# Combine plots in a 2x2 grid with a shared legend and main title(p1 + p2) / (p3 + p4) +plot_layout(guides ="collect", heights =c(1, 1, 0.2)) +plot_annotation(title ="Walmart Store Openings by Year",caption =""# Optional: ensures no extra caption interferes ) &theme(legend.position ="right",legend.direction ="vertical",legend.box.background =element_rect(color ="black", linewidth =0.5),legend.box.margin =margin(t =2, r =2, b =2, l =2), # space above legendplot.title =element_text(hjust =0, size =18, face ="bold", margin=margin(b =10)) # space below title )
Animation
The gganimate package allows us to create animations within a ggplot framework
An animation is essentially a series of static images shown one after the other
use the transition_states() function to create a series of plots
plot the expansion of Walmart by year
create a new variable in the data called \(year\)
within transition_states() function we specify the new year variable as the states argument
specify
how long we want each state visible with the state_length argument
how long we want each transition between states to last with the transition_length argument
add the shadow_mark() function to keep the points from previous states visible, adding more points on top
Code
options(device ="cairo_png")library(gifski)library(gganimate)library(lubridate)## round down to year from opendatewalmart <- walmart %>%mutate(year =year(opendate),year =factor(year, levels =as.character(sort(unique(year)))))## create the animationwalmart_animated <-ggplot(data = walmart) +borders(database ="state") +geom_point(aes(x = long, y = lat,color = type,size =ifelse(type=="Distribution \ncenter", 2, 1),alpha=0.6),show.legend =c(color =TRUE, size =FALSE)) +coord_quickmap() +theme_void(base_size=12) +transition_states(states = year,transition_length =0, # Instant transitionstate_length =1) +# 1 second per yearshadow_mark() +labs(title ="Year: {closest_state}") +scale_color_manual(values =c("Supercenter"="blue", "Walmart"="green", "Distribution \ncenter"="red" ),name="Type")
Animate the plot
Code
walmart_animated
Code
## save the animationanim_save(filename="walmart.gif", animation = walmart_animated, width =1920, height =1080)
Racial Segregation
Racial Segregration
In 1969, Thomas C. Schelling developed a simple but striking model of racial segregation
Schelling was awarded the 2005 Nobel Prize in Economic Sciences (joint with Robert Aumann) for his work on segregation (and other research)
the model shows how local interactions can lead to surprising aggregate outcomes
Each household has relatively mild preference for neighbors of the same race
the model predicts strongly divided neighborhoods, with high levels of segregation
extreme segregation outcomes arise even though people’s preferences are not particularly extreme
Set-up of the model
Suppose we have two types of people: white people and black people
Assume there are \(n\) of each type
These people all live on a single unit square
the location of an individual is just a point \((x,y)\), where \(0 < x, y < 1\)
the set of all points \((x,y)\) satisfying is called the unit square (denoted by S)
an individual is happy if 5 or more of her 10 nearest neighbors are of the same type
otherwise, she is unhappy
Initially, individuals are mixed together (integrated)
each individual is now given the chance to stay or move
each individual stays if they are happy and moves if they are unhappy
Simulating the model
cycling through the set of all agents, each agent is now given the chance to stay or move
import matplotlib.pyplot as pltfrom random import uniform, seedfrom math import sqrtimport numpy as npclass Agent:def__init__(self, type):self.type=typeself.draw_location()def draw_location(self):self.location = uniform(0, 1), uniform(0, 1)def get_distance(self, other):"Computes the euclidean distance between self and other agent." a = (self.location[0] - other.location[0])**2 b = (self.location[1] - other.location[1])**2return sqrt(a + b)def happy(self, agents, # List of other agents num_neighbors=10, # No. of agents viewed as neighbors require_same_type=3): # How many neighbors must be same type""" True if a sufficient number of nearest neighbors are of the same type. """ distances = []# Distances is a list of pairs (d, agent), where d is distance from# agent to selffor agent in agents:ifself!= agent: distance =self.get_distance(agent) distances.append((distance, agent))# Sort from smallest to largest, according to distance distances.sort()# Extract the neighboring agents neighbors = [agent for d, agent in distances[:num_neighbors]]# Count how many neighbors have the same type as self num_same_type =sum(self.type== agent.typefor agent in neighbors)return num_same_type >= require_same_typedef update(self, agents, num_neighbors=10, require_same_type=3):"If not happy, then randomly choose new locations until happy."whilenotself.happy(agents, num_neighbors, require_same_type):self.draw_location()
write a plotting function with white individuals represented by orange dots and black ones represented by green dots
Code
def collect_distribution(agents):"""Collect x, y coordinates for each agent type.""" x_values_0, y_values_0 = [], [] x_values_1, y_values_1 = [], []for agent in agents: x, y = agent.locationif agent.type==0: x_values_0.append(x) y_values_0.append(y)else: x_values_1.append(x) y_values_1.append(y)return x_values_0, y_values_0, x_values_1, y_values_1def plot_side_by_side(initial_data, final_data, cycle_num):"""Plot initial and final distributions side by side.""" fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5)) plot_args = {'markersize': 8, 'alpha': 0.6}# Initial distribution ax1.set_facecolor('azure') ax1.plot(initial_data[0], initial_data[1], 'o', markerfacecolor='orange', **plot_args) ax1.plot(initial_data[2], initial_data[3], 'o', markerfacecolor='green', **plot_args) ax1.set_title('Initial Distribution (Cycle 0)') ax1.set_xlabel('X') ax1.set_ylabel('Y')# Final distribution ax2.set_facecolor('azure') ax2.plot(final_data[0], final_data[1], 'o', markerfacecolor='orange', **plot_args) ax2.plot(final_data[2], final_data[3], 'o', markerfacecolor='green', **plot_args) ax2.set_title(f'Final Distribution (Cycle {cycle_num-1})') ax2.set_xlabel('X') ax2.set_ylabel('Y') plt.tight_layout() plt.show()
Now we can run the simulation
Code
def run_simulation(num_of_type_0=1000, num_of_type_1=1000, max_iter=100000, # Maximum number of iterations set_seed=123, num_neighbors=10, require_same_type=5):# Set the seed for reproducibility seed(set_seed)# Create a list of agents of type 0 agents = [Agent(0) for _ inrange(num_of_type_0)]# Append a list of agents of type 1 agents.extend(Agent(1) for _ inrange(num_of_type_1))# Collect initial distribution initial_data = collect_distribution(agents)# Initialize a counter count =1# Loop until no agent wishes to movewhile count < max_iter:print(f'Entering loop {count}') count +=1 no_one_moved =Truefor agent in agents: old_location = agent.location agent.update(agents, num_neighbors, require_same_type)if agent.location != old_location: no_one_moved =Falseif no_one_moved:break# Collect final distribution final_data = collect_distribution(agents)# Plot initial and final distributions side by side plot_side_by_side(initial_data, final_data, count)if count < max_iter:print(f'Converged after {count} iterations.')else:print('Hit iteration bound and terminated.')
try running the simulation with different numbers of agents
Code
# Run the simulation with example parametersrun_simulation( num_of_type_0=1000, num_of_type_1=1000, max_iter=100000, set_seed=123, num_neighbors=10, require_same_type=5# Example: require 5 neighbors of the same type)
The O-ring theory of economic development was proposed in 1993 by Michael Kremer, who was awarded the 2019 Nobel Prize in Economic Sciences for his work on economic growth and development (along with Esther Duflo and Abhijit Banerjee)
The theory is based on the idea that the multiplicative nature of production in tasks require high-quality inputs working together
The theory is named after the O-ring, a critical component in the Space Shuttle Challenger that failed in 1986, leading to the shuttle’s explosion shortly after launch
firms combine inputs (e.g., workers’ skills) in a way that the failure or lower quality of one input significantly reduces output
The theory suggests that countries with low average skill levels may not be able to produce high-quality goods, as the risk of failure in one task can lead to overall poor performance
O-rings in a spaceship where one weak link can cause catastrophic failure
this leads to assortative matching, where high-skill workers cluster together, amplifying income disparities
Simulating the O-ring theory
models workers with slightly skewed ability distributions
assigns them to firms based on skill levels
calculates output using an O-ring production function
Goal: The simulation shows how even modest ability differences can lead to significant income inequality
Some setting
Code
import numpy as npimport matplotlib.pyplot as plt# Parametersn_workers =1000# Number of workersn_firms =100# Number of firmsworkers_per_firm = n_workers // n_firmsalpha =0.8# O-ring production function parameter (0 < alpha < 1)sigma =0.2# Standard deviation for ability distribution (small skew)
Key Steps
Step 1: Generate worker abilities (log-normal for slight skew)
Step 5: Calculate income inequality (Gini coefficient)
Code
def gini_coefficient(incomes): n =len(incomes) sorted_incomes = np.sort(incomes) cumulative_income = np.cumsum(sorted_incomes)# Lorenz curve area lorenz_area = np.sum(cumulative_income) / (n * cumulative_income[-1])# Gini = (Area between line of equality and Lorenz curve) / (Area under line of equality) gini =1-2* lorenz_areareturn ginigini = gini_coefficient(wages)
Step 6: Plot the distribution of wages and display the Gini coefficient
Code
import seaborn as snsimport matplotlib.pyplot as plt# Set a modern stylesns.set_style("whitegrid") # Adds a clean grid backgroundplt.rcParams['font.family'] ='Arial'# Use a clean, professional fontplt.rcParams['axes.titlesize'] =14# Increase title font sizeplt.rcParams['axes.labelsize'] =12# Increase label font sizeplt.rcParams['xtick.labelsize'] =10# Adjust tick label sizeplt.rcParams['ytick.labelsize'] =10# Create figure with a larger size for better visibilityplt.figure(figsize=(14, 6), facecolor='white')# Plot ability distributionplt.subplot(1, 2, 1)sns.histplot(abilities, bins=50, kde=True, color='#1f77b4', alpha=0.7, stat='density', edgecolor='black', linewidth=0.5)plt.title(f'Ability Distribution\n(Mean={np.mean(abilities):.2f}, Std={np.std(abilities):.2f})', fontsize=14, fontweight='bold', pad=15)plt.xlabel('Ability', fontsize=12)plt.ylabel('Density', fontsize=12)# Add mean lineplt.axvline(np.mean(abilities), color='red', linestyle='--', linewidth=2, label='Mean')plt.legend()# Plot wage distributionplt.subplot(1, 2, 2)sns.histplot(wages, bins=50, kde=True, color='#2ca02c', alpha=0.7, stat='density', edgecolor='black', linewidth=0.5)plt.title(f'Wage Distribution\n(Gini Coefficient={gini:.3f})', fontsize=14, fontweight='bold', pad=15)plt.xlabel('Wage', fontsize=12)plt.ylabel('Density', fontsize=12)# Add median lineplt.axvline(np.median(wages), color='purple', linestyle='--', linewidth=2, label='Median')plt.legend()# Adjust layout and add a subtle background colorplt.tight_layout()plt.suptitle('Distribution Analysis', fontsize=16, fontweight='bold', y=1.05)plt.gcf().set_facecolor('#f5f5f5') # Light gray background for the figure# Show plotplt.show()
library(tidyverse)## load the datafollow <-read_csv("https://raw.githubusercontent.com/kwan-MSDA/FOSS/refs/heads/main/twitter-following.csv")senator <-read_csv("https://raw.githubusercontent.com/kwan-MSDA/FOSS/refs/heads/main/twitter-senator.csv")## examinehead(follow)
# A tibble: 6 × 2
following followed
<chr> <chr>
1 SenAlexander RoyBlunt
2 SenAlexander SenatorBurr
3 SenAlexander JohnBoozman
4 SenAlexander SenJohnBarrasso
5 SenAlexander SenBennetCO
6 SenAlexander SenDanCoats
Code
head(senator)
# A tibble: 6 × 4
screen_name name party state
<chr> <chr> <chr> <chr>
1 SenAlexander Lamar Alexander R TN
2 RoyBlunt Roy Blunt R MO
3 SenatorBoxer Barbara Boxer D CA
4 SenSherrodBrown Sherrod Brown D OH
5 SenatorBurr Richard Burr R NC
6 SenatorBaldwin Tammy Baldwin D WI
# A tibble: 3 × 5
name party state indegree outdegree
<chr> <chr> <chr> <dbl> <dbl>
1 Tom Cotton R AR 64 15
2 Richard J. Durbin D IL 60 87
3 John Barrasso R WY 58 79
Code
## with slice_maxslice_max(senator, order_by = indegree, n =3) %>%arrange(desc(indegree)) %>%select(name, party, state, indegree, outdegree)
# A tibble: 5 × 5
name party state indegree outdegree
<chr> <chr> <chr> <dbl> <dbl>
1 Tom Cotton R AR 64 15
2 Richard J. Durbin D IL 60 87
3 John Barrasso R WY 58 79
4 Joe Donnelly D IN 58 9
5 Orrin G. Hatch R UT 58 50
senator %>%mutate(betweenness_dir =betweenness(twitter_adj,directed =TRUE),betweenness_undir =betweenness(twitter_adj,directed =FALSE)) %>%ggplot(aes(x = betweenness_dir,y = betweenness_undir, color = party,shape = party)) +geom_point() + scale_color_parties + scale_shape_parties +labs(main ="Betweenness", x ="Directed", y ="Undirected") +theme_classic(base_size =22)
Calculate the pagerank
Code
senator <-mutate(senator,page_rank =page_rank(twitter_adj)[["vector"]])## Create igraph objectnet <-graph_from_data_frame(d = follow,vertices = senator,directed=T)## View the new objectnet
IGRAPH ba93275 DN-- 91 3859 --
+ attr: name (v/c), party (v/c), state (v/c), indegree (v/n), outdegree
| (v/n), page_rank (v/n)
+ edges from ba93275 (vertex names):
[1] Lamar Alexander->Roy Blunt Lamar Alexander->Richard Burr
[3] Lamar Alexander->John Boozman Lamar Alexander->John Barrasso
[5] Lamar Alexander->Michael F. Bennet Lamar Alexander->Daniel Coats
[7] Lamar Alexander->Susan M. Collins Lamar Alexander->John Cornyn
[9] Lamar Alexander->Bob Corker Lamar Alexander->Michael B. Enzi
[11] Lamar Alexander->Joni Ernst Lamar Alexander->Chuck Grassley
[13] Lamar Alexander->Cory Gardner Lamar Alexander->Orrin G. Hatch
+ ... omitted several edges
Code
## look at some network edges, nodes, and node (vertex) attributeshead(E(net)) ## E() for edges
+ 6/91 vertices, named, from ba95838:
[1] Lamar Alexander Roy Blunt Barbara Boxer Sherrod Brown
[5] Richard Burr Tammy Baldwin
Code
head(V(net)$party)
[1] "R" "R" "D" "D" "R" "D"
Plot the network
Code
# Identify top 5 senators by PageRank for each partytop5_by_party <- senator %>%group_by(party) %>%top_n(5, page_rank) %>%arrange(party, desc(page_rank)) %>%select(name, party, page_rank) # Assuming 'name' is the column with senator names# Create a vector of labels, defaulting to NAV(net)$label <-NA# Assign labels for the top 5 senators per party# Match senator names from top5_by_party to vertex names in the networkfor (i in1:nrow(top5_by_party)) { vertex_idx <-which(V(net)$name == top5_by_party$name[i]) # Adjust 'name' to your vertex ID columnif (length(vertex_idx) >0) {V(net)$label[vertex_idx] <- top5_by_party$name[i] }}## Vector of colors of the nodes based on partycol <- senator %>%mutate(col =case_when(party =="R"~"red", party =="D"~"blue",TRUE~"black")) %>%select(col) %>%pull()# Plot the network with labels for top 5 per partyplot(net, vertex.size =V(net)$page_rank *1000,vertex.label =V(net)$label, # Use the new label attributevertex.label.cex =0.8, # Adjust label sizevertex.label.color ="black", # Label colorvertex.color = col,edge.arrow.size =0.1,edge.width =0.5,main ="Twitter Network of US Senators (Top 5 PageRank Labeled)",layout =layout_with_kk(net, niter =1000))# Add a legend for party colorslegend("topright", # Position of the legend (adjust as needed: "topleft", "bottomright", etc.)legend =c("Republican", "Democrat", "Other"),col =c("red", "blue", "black"),pch =19, # Solid circle symbol for nodesbty ="n", # No box around legendcex =0.8) # Adjust legend text size
Social Network Analysis